Counts the number of lines correctly for files with certain multibyte encodings #1211

alindeman · 2014-05-21T15:52:45Z

Currently, loc and sloc are incorrect for UTF-16 encoded files with Windows line endings (example). This is because data and the newline regular expression are encoded as ASCII-8BIT or UTF-8, even if the encoding is detected as something else. In the case of UTF-16LE, Windows newlines are encoded as "\r\0\n\0". Because of the \0 separating the \r and \n, the current regular expression (\r|\n) counts two lines even though it should only be interpreted as one single line break.

The symptom is that UTF-16LE encoded files with Windows line endings are rendered with whitespace at the bottom that doesn't actually exist because the line count is calculated as twice what it should be.

/cc: @github/encodings

jdennes · 2014-05-21T15:56:11Z

Thanks for the quick fix! ⚡

alindeman · 2014-05-21T16:59:43Z

Thanks for the quick fix! ⚡

np, was a fun one to track down. I'd still like some 👀 and a 👍 from @github/encodings before I try this out.

brianmario · 2014-05-21T17:13:41Z

nice catch! 👍

… encodings

It looks like it's valid to call this method even if `binary?` is true. Encoding as 'ASCII-8BIT' should always succeed.

alindeman · 2014-05-21T18:07:19Z

Awkwardly enough, we've made an implicit guarantee that the data exposed through lines is encoded in its original form (in practice, ASCII-8BIT or UTF-8) even if the file is detected as a different encoding. Certain other code seems to actually depend on this.

For now, I'll make a change forcing each line back to the original encoding unless someone has a better idea.

alindeman · 2014-05-21T19:29:52Z

lib/linguist/blob_helper.rb

+          encoded_newlines = ["\r\n", "\r", "\n"].
+            map { |nl| nl.encode(encoding).force_encoding(data.encoding) }
+
+          data.split(Regexp.union(encoded_newlines), -1)


New strategy that keeps each entry in lines the same encoding as data, which (for GitHub blobs) is ASCII-8BIT. Seem OK, @github/encodings?

hot, I like that

Counts the number of lines correctly for files with certain multibyte encodings

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in github-linguist#1211). So the error would only be seen in wild usage (see issue github-linguist#353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.

Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings.

* Prepare 7.9.0 release * Put back the v7.8.0 version We need this to ensure the versioning used during testing on GitHub.com doesn't cause caching problems in future * fix errors on non-UTF-8 encodings Some files failed with "invalid byte sequence in UTF-8 (ArgumentError)" when BlobHelper#lines was called. Some problematic files include UTF-16LE samples such as test/fixtures/Data/utf16le. Errors were not present when computing stats from git repositories, since git blobs are always read as ASCII-8BIT and that was working correctly. However, when using FileBlob, encoding could be ASCII-8BIT, UTF-8 or other, depending on the runtime value of Encoding.external_encoding. Tests were not catching the error since they were forcing Encoding.external_encoding to be ASCII-8BIT (introduced in #1211). So the error would only be seen in wild usage (see issue #353). This commit forces ASCII-8BIT on File.read calls. The error is still present if using memory blobs with other encodings. * Decrease expected error count * Set version to 7.9.0 Co-authored-by: Rick Winfrey <[email protected]> Co-authored-by: Santiago M. Mola <[email protected]>

alindeman mentioned this pull request May 21, 2014

Bumping to 2.11.0 #1194

Merged

alindeman added 2 commits May 21, 2014 13:36

Counts the number of lines correctly for files with certain multibyte…

85efbde

… encodings

Makes sure we do not fail if encoding == nil

185db0e

It looks like it's valid to call this method even if `binary?` is true. Encoding as 'ASCII-8BIT' should always succeed.

Takes a different approach

09a33f8

alindeman reviewed May 21, 2014
View reviewed changes

alindeman added a commit that referenced this pull request May 22, 2014

Merge pull request #1211 from alindeman/multibyte_line_count

6a192da

Counts the number of lines correctly for files with certain multibyte encodings

alindeman merged commit 6a192da into github-linguist:master May 22, 2014

alindeman deleted the multibyte_line_count branch May 22, 2014 15:27

smola mentioned this pull request Dec 1, 2019

fix errors on non-UTF-8 encodings #4730

Merged

github-linguist locked as resolved and limited conversation to collaborators Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Counts the number of lines correctly for files with certain multibyte encodings #1211

Counts the number of lines correctly for files with certain multibyte encodings #1211

alindeman commented May 21, 2014

jdennes commented May 21, 2014

alindeman commented May 21, 2014

brianmario commented May 21, 2014

alindeman commented May 21, 2014

alindeman May 21, 2014

brianmario May 22, 2014

Counts the number of lines correctly for files with certain multibyte encodings #1211

Counts the number of lines correctly for files with certain multibyte encodings #1211

Conversation

alindeman commented May 21, 2014

jdennes commented May 21, 2014

alindeman commented May 21, 2014

brianmario commented May 21, 2014

alindeman commented May 21, 2014

alindeman May 21, 2014

Choose a reason for hiding this comment

brianmario May 22, 2014

Choose a reason for hiding this comment